This report analyzes the 2014 financial data for the public school system in the US for elementary and secondary schools. More details of this raw data can be found in http://www.census.gov/govs/school/. We will also take a quick peek at the relationships (if any) between this data and the top 50 school districts to see how much of role a school’s finances play in a school’s ranking.
*Note: data for top 50 schools was manually created as there is a paywall for API access. Hence, only 50 schools. (Source: https://k12.niche.com/rankings/public-school-districts/best-overall/)
# Load the Data
schoolFinanceFull <- read.csv("elsec14.csv", stringsAsFactors = FALSE,
colClasses = c("SCHLEV"="character",
"STATE"="character",
"IDCENSUS" = "character"))
top_50 <- read.csv("top_schools.csv", stringsAsFactors = FALSE)
top_50$district <- toupper(top_50$district)
schoolFinanceFull <- left_join(schoolFinanceFull, top_50,
by=c("NAME"="district"))
schoolFinanceFull$isTop50 <- !is.na(schoolFinanceFull$rank)
#Data set cleanup
schoolFinanceFull$STATE <- as.factor(schoolFinanceFull$STATE)
schoolFinanceFull$SCHLEV <- as.factor(schoolFinanceFull$SCHLEV)
schoolFinanceFull$CSA <- as.factor(schoolFinanceFull$CSA)
schoolFinanceFull$CBSA <- as.factor(schoolFinanceFull$CBSA)
levels(schoolFinanceFull$SCHLEV) <- c("ELE", "SEC", "ELE-SEC", "VOC-SP",
"NOOP", "AGEN")
names(schoolFinanceFull)[names(schoolFinanceFull) == 'V33'] <- 'NUMSTUDENTS'
#We will change the state fips code to State abbreviation
data("state.regions")
schoolFinanceFull$STATE_FULL <- schoolFinanceFull$STATE
levels(schoolFinanceFull$STATE) <-
state.regions[order(state.regions$fips.numeric),]$abb
levels(schoolFinanceFull$STATE_FULL) <-
state.regions[order(state.regions$fips.numeric),]$region
schoolFinanceFull$GOVTYPE <- as.factor(substr(schoolFinanceFull$IDCENSUS, 3,3))
levels(schoolFinanceFull$GOVTYPE) <- c("STATE", "COUNTY", "CITY", "TOWNSHIP",
"INDEPENDENT")
We display the code to load the data here to show that the dataset we curated was an amalgamation of multiple datasets. Some data clean up was also required, as some values are coded as integers such as the school type (elementary, or secondary). Also the variable, IDCENSUS is an identification code for each dependent school system. It is made up of various other codes such as county codes and state codes. We already have the state and county information in other fields, so we extracted the “government type” information from this field and made it its own variable.
## [1] 14397 145
## CONUM STATE NUMSTUDENTS GOVTYPE
## 17031 : 160 CA :1053 Min. : 0 STATE : 31
## 6037 : 92 TX :1047 1st Qu.: 305 COUNTY : 451
## 34003 : 76 IL : 992 Median : 979 CITY : 211
## 36103 : 68 OH : 739 Mean : 3358 TOWNSHIP : 463
## 4013 : 58 NY : 680 3rd Qu.: 2746 INDEPENDENT:13241
## 25017 : 58 MI : 601 Max. :989012
## (Other):13885 (Other):9285
## SCHLEV TOTALREV TOTALEXP TFEDREV
## ELE : 2482 Min. : 0 Min. : 0 Min. : 0
## SEC : 429 1st Qu.: 5073 1st Qu.: 4965 1st Qu.: 285
## ELE-SEC:10470 Median : 13552 Median : 13231 Median : 835
## VOC-SP : 244 Mean : 43635 Mean : 43399 Mean : 3673
## NOOP : 39 3rd Qu.: 36575 3rd Qu.: 35675 3rd Qu.: 2412
## AGEN : 733 Max. :24215438 Max. :26009861 Max. :1814405
##
## TSTREV TLOCREV isTop50
## Min. : 0 Min. : 0 Mode :logical
## 1st Qu.: 2049 1st Qu.: 1759 FALSE:14347
## Median : 6295 Median : 5149 TRUE :50
## Mean : 20045 Mean : 19917 NA's :0
## 3rd Qu.: 15926 3rd Qu.: 15930
## Max. :9352527 Max. :13048506
##
Our final dataset consist of 14,397 observations, with 145 variables. Most of these variables will not be used for this analysis, as it is simply a breakdown of other variables. For example, school revenue (TOTALREV) is broken down to revenue from federal, state, and local sources (TFEDREV, TSTREV, and TLOCREV respectively). These in turn are broken down further. The codebook (http://www2.census.gov/govs/school/school14doc.pdf) contains details of each of variables.
Let’s start by looking at the distribution of revenue and expenditures of all school districts.
The revenue and expenditure distribution looks very similar in terms of the distribution curve and well as values of the distribution. This suggests that there could be a relationship, but it is too early to conclude at this point. Note that we removed the top 1 percent in terms of revenue and expenditures in the plots above. We can see below that the max of both total revenue and expenditure was far from the 3rd quantile suggesting that there are a couple of outliers in our dataset.
## TOTALREV TOTALEXP
## Min. : 0 Min. : 0
## 1st Qu.: 5073 1st Qu.: 4965
## Median : 13552 Median : 13231
## Mean : 43635 Mean : 43399
## 3rd Qu.: 36575 3rd Qu.: 35675
## Max. :24215438 Max. :26009861
Next, we look at the distribution of the number of students in each district.
It seems like most schools districts have under 1000 students. However, a good group of these have zero student.
nrow(schoolFinanceFull[schoolFinanceFull$NUMSTUDENTS==0,])
## [1] 884
Upon further inspection, many of the zero student school districts are “Nonoperating School System”, or “Educational Service Agency”. The rest could simply be incomplete data. We will keep this in mind in our analysis going forward.
nrow(subset(schoolFinanceFull, NUMSTUDENTS==0 & SCHLEV %in% c("NOOP","AGEN")))
## [1] 772
Replotting the distribution on number of students, we get:
Finally, let’s look the distribution of the different types of schools and the number of school districts within each state.
Surprisingly, most school districts and independent school systems and only a small number are state/country city/township dependent school systems. We also note that most school districts are a combination of elementary and secondary schools.
To get a better understanding of the distribution of the different school systems, we will remove the independent school system:
It looks like there are fewer that 50 state dependent school systems in the country and only around 1100 districts that are not independent school systems.
There are other features such as county data, and revenue breakdown that we will use in our bivariate and multivariate analysis.
We’ll start by looking at total revenue as compared to other parameters.
Not surprisingly the states with the highest revenues are the states that are also the highest in population (http://www.enchantedlearning.com/usa/states/population.shtml).
We can see that certain a small number counties make up most of the state’s total revenue. It is also interesting to see that is a band that cuts down right in the center of the country where the counties have very little school revenue, which from the state level map suggest that it’s an area of low population in the US, assuming that the revenue per student is even across the board. We’ll explore that idea later below.
For now, let’s take a quick look at the expenditures.
This looks very similar the the total revenue map…
A plot of revenue vs expenditure with the blue line being the break even point (revenue = expenditure) reveals that pretty much all schools spend everything they get in revenue, and in many cases the district spends more than they have. This relationship suggests that schools are run on a very tight budget and/or school districts are pressured to spend all of their fundings in order to maintain the same level of funding from year to year. It would be interesting to follow the school districts over time to see how their funding levels change.
Let’s switch gears for now and look at some trends in the top 50 school districts.
We can see that the most of the top 50 school districts are clustered in the northeastern corner of the country (33!), with New York State having 14 of the top 50 districts.
Although the map is a little difficult to read, we can still see that there is still some clustering effect even at the county level. Looking deeper into the county level of the top two states (New York, and Illinois)…
The clustering does not only happen with neighboring counties, but even in the counties themselves, with two neighboring counties in New York having 6 top 50 districts each.
Do these districts put more money per student as compared to other districts?
While the top counties are amongst the highest in terms of funding per student, we can see that there are other counties that are in the higher tier that do not have any top schools districts. This also shows that there is a disparity between counties on how much funding a school district gets per student.
Looking a total revenue compared to number of students:
The blue line on the plot a linear fit between revenue and number of student. Although the relationship does look linear, this is not a very strong relationship as we see from the above plots that there is quite a bit of variance in funding per student between counties.
We’ll breakdown the revenue and number of students relationship some more.
First by the school type:
Vocational school districts have a very high revenue to student ratio, whereas elementary, secondary, and elementary-secondary school districts seems to have very similar revenue to student ratio. Let’s look at these three school types specifically since vocational schools are quite different in nature compared to traditional elementary and secondary schools.
Even though most school districts are a combination of elementary and secondary schools, we can still see that on average, secondary school only districts get higher funding per student while elementary school only districts has less funding. We also marked with X’s the districts that are in the top 50 ranking and we can see that the revenue to student ratio is consistently higher than average.
Next, we breakdown the revenue to student relationship by government type.
Although there are not too many data points, it looks like school districts that are state dependent, city dependent, and township dependent are generally better funded than independent school district. Also, while most of the top 50 schools are independent school districts, there are a few top 50 school districts that are township dependent. There is not enough data to give any conclusions, but it would be interesting to see if township dependent school districts will have more top ranks if there are more ranking data.
We will now look at revenue at a different angle by breaking down total revenue to its federal and state constituents.
There is a lot to unpack here. First, we see that state funding is consistently higher than federal funding (except for District of Colombia where they do not have any state funding). We also noticed that federal funding is generally equal across states although there are a few exceptions. Since D.C does not have state funding, it had higher federal funding. Vermont, on the other hand have very low federal funding and the funding is mostly left to the state. Finally, we continued to mark the top 50 school district with X’s, and interestingly many of the top 50 school districts are not well funded at the federal level or state level. Many of the districts fall below the first quartile. But notice that in our previous plots, the top50 schools are generally well funded. So where does the districts revenue come from?
It turns out that most of the top 50 districts have the most funding per student from local sources compared to other districts. We note that the relationship between revenue and number of students is no longer linear. It is also interesting to note that elementary only school districts have more local revenue per student than elementary-secondary school districts.
There seems to be a clustering effect when states have multiple districts that are in the top 50 ranking, all the way down to the county level. However, we realize that there is only 50 data points and this finding is not conclusive. It would be worthwhile to collect more data on school district rankings to see if this clustering effect holds.
The plot above shows a counterintertuitive relationship between federal and state funding to top ranking schools. The top districts is consistently getting less revenue from federal and state sources, and is worth further investigation as to why this is the case. Is there actually a correlation between ranking and revenue allocation or are there other factors at play? We also note that the state of Vermont has much lower federal funding than all the other states, with the median being less than $100 per student.
This is the strongest relationship we have found throughout this analysis. Local revenue streams seems to affect ranking of school districts. Perhaps this is a reflection on the emphasis on education communities have. Local revenue could be a very good feature to use if we were to build a prediction model on top ranked school districts.
There were many uncertainties from the beginning when analyzing this data it was not a familiar subject. We were not sure if there were any trends and patterns to this data, but we were able to uncover some interesting relationships in our dataset like local revenue versus top 50 school districts and the clustering effect of top school districts. We were able to pull in multiple data sources and it was surprising that merging this data together was not as troublesome as we intially thought. One issue we found when looking at the clustering effect was that the county boundaries data was incomplete and one of the counties we needed in our visualization (Dupage county) was not in the dataset. Fortunately we were able to find the Dupage county boundary data from another source and we were able to proceed with our analysis.
Although we found some good relationships between a school district’s finances with respect to its rankings we did not build out a model as this data alone does not tell the whole story of how a school will be ranked. Demographics data and being able to normalize revenue and expenditure based on a cost of living index would very likely paint a much more complete picture of and would be worth further exploration.
Having this data could also allow us to explore other aspects of a school districts finances such as “where funding is allocated versus the demographics of an area? (Ex. school lunch programs)”. Finally, we noted earlier that this analysis only looked at the top 50 districts, whereas the ability to analysis perhaps the top 200 or even 1000 districts could reveal more patterns or trends.